MIME-Version: 1.0
Server: CERN/3.0
Date: Sunday, 01-Dec-96 19:31:05 GMT
Content-Type: text/html
Content-Length: 3351
Last-Modified: Monday, 21-Oct-96 21:38:43 GMT

<html>
<head>
<title>Data Mining</title>
</head>
<body BACKGROUND="stucco.jpg">
<h1 align=center>EMV - CS537 Project Proposal</h1>
<hr>
<br>
<center>
<h2>Classification in Data Mining</h2>
<br>
<h4>Eric Vitrano</h4>
</center>
<!WA0><!WA0><!WA0><!WA0><img align=center src="http://www.cs.cornell.edu/Info/People/vitrano/colorbar.gif">
<br>
<body>
<p>
<u><h3>Common Level</u></h3>
Class table
<ul>methods :
<ul>open(filename)<br>
	close(filename)<br>
	write tuple(char *) /* sends in a char string of the whole record in 
			ascii */<br>
	read tuple(tuple number) /* will read the tuple and return a tuple 
			instance */<br>
	get_scheme /* returns a char string which lists the scheme */<br>
	set_scheme /* takes in a char string and sets the scheme of the 
			table to that scheme */<br>
</ul></ul>
Class tuple
<ul>method:
<ul>	get_attribute(attribute number, location to copy the 
	attribute value to )
</ul></ul>

</p>
<hr>
<p>
<u><h3>Data Mining Classifiers</u></h3>
<b>Stage 1</b>
<br>
Once the above groundwork is complete, I will implement a version of an elementary
data mining classification algorithm.  This algorithm will be based on the ID-3
decision tree model, with limited pruning.  A summary of the algorithm in pseudocode
form is as follows:

<ul>
	Tree Building<br>
<ul>		MakeTree (Training Data)<br>
		{
		   Partition (Training Data);
		}<br>

		Partition (Data)
		{
		   If all (s in S) in same class - tree done.<br>
		   Else for each attribute, find best split (Split (S)), and partition.<br>
		   Partition (All partitions from above).
		}<br>
</ul>
	Tree Pruning<br>
<ul>		RemoveNode (Node)<br>
		{
		   For all (nodes in Node)<br>
			If (node in Node) has same class value as parent, remove.
		}<br>


	Split Evaluation<br>
		Split (Data)<br>
		{
		   For each attribute, calculate goodness of an attribute. <br>
			return highest goodness.
		}<br>


		Split_Partition (Data)<br>
		{
		   Partition Data into two sets based on goodness from Split.
		}<br>
</ul></ul>
<br>
The above algorithm will be implemented in Visual C++, with the intention to build a decision
tree that will classify tuples into defined classes.  The tree must be trained using a training
set where the classes of the tuples is known, and then tested on data to see if the returned
classes are of the appropriate type.  The results can then be used for directing queries on 
incoming data, as well as classifying existing data.

</p>
<p>
<b>Stage 2</b>
<br>
When the above algorithm is implemented, a further algorithm will be implemented.  This next
algorithm will either be related to SLIQ, or will be something generated by observing the
development and processes of the general case.

Possible areas of improvement would be pruning on the fly, limiting the searches of the data
and the amount of data needed to be kept in memory, and presorting/partial classification of
the data.
<br><br>
</p>
<hr>
<p>
<h3><u>Time Estimates</h3></u>
I would expect the following schedule to be an approximate scheme for progress:
<br>
<h4>
<ul>
	October  21 - Completion of groundwork steps.
<br>	November  4 - Completion of the general algorithm.
<br>	November 25 - Completion of Stage 2 algorithm.
<br>	December  2 - Evaluation and further consideration of data mining classifictiaon.
</ul></h4>
</p>
</body>
<br>
<br>
<!WA1><!WA1><!WA1><!WA1><img align=center src="http://www.cs.cornell.edu/Info/People/vitrano/colorbar.gif">
<br>
<li>
<h5><!WA2><!WA2><!WA2><!WA2><a href="http://www.cs.cornell.edu/Info/People/vitrano/vitrano.html">EMV Home Page</a></h5>
<br>
</html>

